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ABSTRACT 



This work revisits ab-initio methods to identify natively unfolded pro- 
teins. Single predictors and combined score indexes are considered and their 
performance is critically evaluated against other methods already present 
in the literature. We consider mean packing (< P >), mean contact 
energy(< E c >) and a new index of folding status, based on VSL2 (gVSL2), 
a predictor of single disordered amino acids. We use a new dataset made 
of 743 folded proteins and 81 natively unfolded proteins. Individual use of 
these predictors has a performance comparable or even better than other 
proposed methods: gVSLI reaches a sensitivity (S n ) of 0.81, a specificity 
(S p ) of 0.89 and a level of false predictions (/ p ) of 0.11. The performance 
of these single predictors is significantly improved if used in combination. 
We introduce a strictly unanimous combination score Ssu an d a new score 
So, combining 10 dichotomic predictors. The former score leaves some se- 
quences undecided, whereas the latter classifies with no exceptions all the 
sequences in a dataset. Through the combined use of both scores we get: 
S' ra =0.79, S'p=0.94 and / p =0.06, with less than 6% of proteins left unpre- 
dicted. The combined use of Ssu an d So applied to the problem of finding 
the frequency of occurrence of natively unfolded proteins in genomes from 
Nature's three kingdoms gives the following figures: the percentage of na- 
tively unfolded proteins predicted by Ssu are 4.1% for Bacteria, 1.0% for 
Archaea and 20.0% for Eukarya; comparable, but not coincident with simi- 
lar previous determinations. Evidence is given of a scaling law relating the 
number of natively unfolded proteins with the total number of proteins in a 
genome; a first estimate of the critical exponent is 1.95 ± 0.21. 
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INTRODUCTION 



In the past few years it has been discovered that several proteins, in physi- 
ological conditions, lack a well defined tertiary structure, existing as an en- 
semble of flexible conformations. These proteins, denoted in the literature 
as natively unfolded or intrinsically disordered, are characterized, microscop- 
ically, by an high atomic diffusivity all along their sequence. Nevertheless 
they are involved in important cellular functions, like signalling, target- 
ing or DNA binding [U EJ El HI El El [7] ; their existence clearly shifts the 
structure-function paradigm, that regards the tertiary structure of a protein 
as necessary for its biological function [Bj. It has been suggested that na- 
tively unfolded proteins may also play critical roles in the development of 
cancer [9j; moreover, the absence of a rigid structure allows them to bind 
different targets with high specificity and low affinity, suggesting that they 
are hubs in protein interaction networks [10|, \TT\ [T2"] . It is worth noting that 
unstructured regions may be present also in folded proteins, conferring a 
high flexibility on them. A specific local flexibility in these partially un- 
folded proteins might play a dynamical role in modulating their interactions 
with other macromolecules. 

In this work we investigate sequence-only, ab-initio, methods to identify 
natively unfolded proteins. Computational approaches aimed at identify- 
ing unstructured regions in proteins are very useful, since the experimental 
characterization of these regions is flawed by a certain ambiguity, due to 
the several techniques available, that often give conflicting views on the 
same protein [131 EJ- I n particular, predictors of natively unfolded proteins 
may be useful to fastly screen datasets of amino acid sequences, looking for 
those that have a high tendency to remain unfolded; and this is the main 
application that we have in mind in this work. 

On one hand, several methods have been proposed to predict unstruc- 
tured segments in proteins [T^ [T^ [TZl [T51 [T^ 1211 1221 IIS IIS - 
These methods aim at identifying disordered amino acids, i.e. residues for 
which it is hard to determine experimentally, using X-ray cristallography or 
NMR spectroscopy, the average positions of their atoms [29]. Predictors of 
disordered amino acids are useful to find unstructured regions in partially 
unfolded proteins, but they do not highlight immediately whether a pro- 
tein globally folds or not. Besides, unfolded segments may have a wealth 
of different static and dynamic properties, but each predictor is generally 
focussed on just one specific characteristic, therefore it seems wise to com- 
bine the information from different indicators to obtain robust predictions 
[301 [31] . On the other hand, other methods have been proposed to predict 
whether a protein is natively unfolded, but the literature on the subject 
appears confused. Several physico-chemical properties have been recognized 
as useful indicators, but they have been used differently by different authors 
[32l l33l l28l [26l EH ES]; moreover the proposed methods have been tested 
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on various datasets, so it is not easy to make comparisons, searching for an 
optimal approach. 

Within the present study we revisited and optimized various methods 
of predicting natively unfolded proteins, and we proposed a few synthetic 
predictors or indexes of fold. Two indexes were based on mean packing [28] 
and mean contact energy [36} [26] . A third index was derived from VSL2 
[67\ [2"7] , a predictor of disordered amino acids that excellently performed in 
the recent CASP7 experiment [29]. Our methods discriminated folded from 
natively unfolded proteins, with sensitivity up to 0.74 and a level of false 
predictions below 0.11, a good performance with respect to other predic- 
tors proposed in the literature. To further improve the performance of our 
indexes we combined them into scoring schemes. We introduced a strictly 
unanimous score Ssv that requires unanimous consensus among the vari- 
ous indexes of fold to classify a protein in one of the two folding classes; 
this score reached a sensitivity of 0.82 and a level of false predictions of 
0.05 and it left unclassified only about 10% of the proteins in the test set. 
It is then a reasonably valid predictor of folding status; moreover the un- 
classified sequences are worth of being investigated per se, as instances of 
proteins with a not well defined folding signature. We introduced also a less 
stringent score, called So, that requires consensus among the majority of 
folding indexes. This score had the advantage of classifying all proteins and 
of giving a quantitative estimate on how definite was a prediction; it allowed 
a refinement of the results obtained using the strictly unanimous score. 

We applied our indexes to evaluate the frequency of natively unfolded 
proteins present in various genomes, obtaining results consistent with those 
reported by Ward et al. using DISOPRED2 [24J. Since our approach is 
quite different from theirs, we think that it is a valid alternative, useful as 
a complement to predictions based on predictors of disordered amino acids, 
such as DISOPRED2. Finally, we observed a significant correlation, using 
our approach, between the number of predicted disordered proteins and 
the number of proteins in genomes of Bacteria, Archaea and Eukarya and 
we determined a scaling law, of possible fundamental significance, to be 
validated by further studies, in the search for a relationship between the 
frequency of natively unfolded proteins and the complexity of an organism. 

MATERIALS AND METHODS 

Datasets 

In this work we used as training set the list of proteins compiled by 
Prilusky to test Foldlndex [38], a web-based server aimed at identifying 
unstructured proteins. It includes 151 folded proteins and 39 proteins 
reported in the literature as natively unfolded . Folded proteins have a 
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length between 50 and 200 amino acids, they do not contain prosthetic 
groups or disulphide bridges and their structures have been determined by 
X-ray cristallography. 

We compiled our own test set starting from PDBSelect25, version October 
2007 [391 140j . that contains 3694 proteins with sequence identity lower than 
25%. To avoid the introduction of poor models we excluded structures 
with a resolution above 2 A and an R-factor above 20%. We obtained a 
list of 1015 folded proteins. From this list, we extracted a restricted list of 
743 fully ordered proteins, that contain less than 5% of disordered amino 
acids. We aligned PDB file SEQRES fields with the ATOM fields and the 
residues that are present in SEQRES but absent in ATOM were considered 
as disordered. To compile a list of natively unfolded proteins, we started 
from the DisProt database, version 3.6 |41|, I42j . We extracted a list of 81 
natively unfolded proteins with at least 95% of disordered amino acids and 
sequence identity below 25%. 

Mean packing 

The mean packing of a protein sequence is the arithmetic mean of the pack- 
ing values of each amino acid. We used the packing index introduced by 
Galzitskaya et al. [28J, based on the number of residues located within a dis- 
tance of 8 A, averaged over a large dataset of structures. We considered a 
sliding window of length 11 and we assigned its mean packing to the central 
residue. 

To set the stage we initially computed mean packings on Prilusky's set 
[38j : we looked for a discriminative threshold as to obtain a sensitivity of 
at least 0.80 and a level of false predictions as low as possible; we found 
it at 20.55, getting a sensitivity of 0.82 and a level of false predictions of 
0.13. We repeated the experiment with sliding windows of different length, 
without improvement of the performance. 

Mean contact energy 

We followed the method by Dosztanyi et al. [26]. The contact energy value 
of an amino acid is a measure of its "contact interaction" with the amino 
acids located from 2 to 100 positions apart, downward and upward, along 
the sequence. There are, of course, constraints due to the length of the 
sequence that should be taken into account in the bookeeping. The contact 
energy of amino acid i at position p is given by: 



where nf is the frequency of amino acid j in a window of length up to 100 
around position p, taking into account possible limitations on both sides 
due to the length of the protein. The generic element Pj,- of the "energy 
predictor matrix" P expresses the expected contact interaction energy 
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between amino acid i and j. 

Contact energy values are averaged over a window of 21 amino acids 
and the average is assigned to the central residue at position p in the 
sequence. Finally, the arithmetic mean of the contact energy values of all 
the amino acids gives the global mean contact energy of the protein. 
To discriminate between folded and natively unfolded proteins, we com- 
puted mean contact energy of the Prilusky's set [38] and we looked for 
a threshold, so to get a sensitivity of at least 0.80 and a level of false 
predictions as low as possible. We found it at -0.37 arbitrary energy unit 
(a.e.u.), getting a sensitivity of 0.85 and a level of false prediction of 0.14. 

Index derived from VSL2 

VSL2 \37\ [27] is a disorder predictor that assigns to each amino acid of a 
protein sequence the probability that the amino acid is disordered, estimated 
using a combination of support vector machines. The score from VSL2 is 
normalized between and 1 and an amino acid is considered disordered if 
its value is above 0.5. 

We used the arithmetic mean of these disorder scores, evaluated using 
VSL2B and output windows of length 11, to discriminate folded from 
unfolded proteins and we call it gV SL2 index. We classified a protein as 
natively unfolded if gVSL2 was above 0.5. 

Combination of two parameters into a single index of fold 

We plotted the values of the two parameters on a plane and we looked for 
discriminative lines. In general there is an overlap region that prevents an 
exact separation of the two groups of sequences. We identified the overlap 
region as the narrower vertical band containing points from both groups. 
For all pairs of points inside the overlap area we traced a line and evaluated 
its performance in separating the two groups of proteins; among all the 
discriminative lines with sensitivity above 0.80, we chose that with lowest 
false predictions. If the equation of a discriminative line is: 
y = ax + b, 

then the corrisponding scalar index of fold was defined as: 
/ = -sign({xf) - (x nf )) ■ sign(a) ■ (y -ax-b) 

where (x/) and (x n f) are, respectively, the mean values of the index x for 

folded and natively unfolded proteins. The defined index was positive for 

folded proteins and negative or for natively unfolded ones. If the slope a 

became very large our code looked for discriminative lines parallel to the 

ordinate axis and the index was defined as: 

/ = sign((x f ) - (x„/))(x - x th ) 

where x = Xth was the optimum discriminative line. 

Definition of score indexes 
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We combined mean packing, mean contact energy and gVSL2 to obtain 
score indexes: Su, Sy and SsUi an d then So. Su and Sy have been previ- 
ously proposed by Oldfield et al. [33J. Su is an unanimous score: a protein 
is classified as natively unfolded if all the folding indexes agree on that, oth- 
erwise it is classified as folded. Sy on the other hand is a voting score: a 
protein is classified as natively unfolded if at least one index assigns it to 
such a class. We proposed a third combination rule: we classified a protein 
as folded only if all the indexes predicted it as folded; conversely, we clas- 
sified a protein as natively unfolded only if all the indexes predicted it as 
natively unfolded. This rule left a protein unclassified if there is disagree- 
ment between at least two indexes. We call this score strictly unanimous, 
Ssu- 

To obtain So, we increased the number of indexes; we took different 
pairs of parameters, we plotted their values into planes and obtained an 
index of fold, as explained in the previous section. We considered all the 
combinations of the four indexes: Uversky's HQ [32J, mean packing, mean 
contact energy and gVSL2 to get 10 new indicators of folding status. We 
combined them into a global score as follows: if an index predicted a protein 
as folded, we incremented the score by 1; if the index predicted a protein as 
unfolded, we decremented the score by 1. We excluded indexes that were 
unable to discriminate folded from unfolded proteins of the training set with 
a sensitivity of at least 0.75 and a level of false predictions above 0.15. The 
score can assume a positive, negative or null value. So classifies a protein 
as folded if its value is positive, otherwise it classifies it as natively unfolded. 

Parameters of performance 

To evaluate the performance of the predictors we used very common 
indicators: [29] : 

Sensitivity: S n = = ^J^, 

Specificity: S p = TN+FP = N folded > 
False predictions: f p = 1 — Sp = tn+fp - 

Where TP stands for True Positive, TN for True Negative, FP for False 
Positive and FN for False Negative. 



RESULTS AND DISCUSSION 

Mean packing, mean contact energy and gVSL2 
We tested the performance of mean packing, mean contact energy and 
gVSL2 on a test set made of 743 folded and 81 natively unfolded pro- 
teins and we compared our indexes with our implementation of the method 
proposed by Uversky and co-workers [32J, called here HQ. The results are 
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reported in table [U As we can see, HQ has, relatively, the worst per- 
formance. Mean packing and mean contact energy exhibit a quite similar 
ability in discriminating the two groups of proteins, whereas gVSL2 has a 
comparatively higher sensitivity, but also a higher level of false predictions. 

Mean packing and mean contact energy have been used previously to 
predict whether a protein is natively unfolded or not. Mean packing has 
been used by Galzitskaya and co-workers [28J. They used a sliding window 
restricted to just one amino acid and a threshold at 20.73. Using their 
setting on our own test set, the sensitivity arose from 0.74 to 0.83. However, 
the level of false predictions also grew, from 0.07 to 0.19. This suggests 
that, using the approach in [28J, one could overestimate the number of 
natively unfolded proteins present in the genome of a given organism. As 
regard contact energy, Dosztanyi et al. consider amino acids with contact 
energy value above -0.2 a.e.u. as disordered [26]; recently, this threshold has 
been used to effectively discriminate folded proteins from natively unfolded 
ones, in a peculiar set of protein complexes [31]. Using the discriminative 
threshold of -0.2 a.e.u. on our test set, sensitivity dropped from 0.74 to 
0.54. This result suggests that the effectiveness of discriminative threshold, 
using single predictors, strongly depends on the chosen test set. 

Combination of indexes into unanimous and voting scores 

We explored the possibility of enhancing the performance of single indexes of 
fold by combining them into several scoring schemes. We analysed previously 
unanimous and voting scores (see Materials and Methods). The results 
of the predictions are reported in table EJ As we can see, Ssu has the 
best performance. Comparing table Q] with table El we observe that the 
performance of Sjj has lower sensitivity with respect to mean packing, mean 
contact energy and gVSL2, whereas Sy has higher sensitivity but also a 
higher level of false predictions. We conclude that Su is less effective than 
Sy- On the other hand Sy must be used with caution, since the higher level 
of false predictions may lead to an overestimate of the number of natively 
unfolded proteins in a given genome. 

On one hand, Ssu had a higher sensitivity and a lower level of false 
predictions with respect to all other indexes. On the other hand, Ssu 
left unclassified all proteins that mean packing, mean contact energy and 
gVSL2 did not jointly predict in the same class, so it could be useful only 
if the percentage of these unclassified proteins is reasonably low. In our 
set of 743 folded and 81 natively unfolded proteins, 80 sequences were left 
unclassified, about 10% of all proteins, an incouraging result; of these 80 
unclassified sequences, 15 were natively unfolded, corresponding to 19% of 
all natively unfolded proteins in the test set; therefore Ssu may have a 
selectivity bias towards folded proteins. Sequences left unclassified by Ssu 
have properties compatible with both classes; in this twilight zone a single 
index would be definitely not reliable, haphazardly forcing the assignment 
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of a protein to one or the other class. Ssu, then, is a conservative reliable 
index, which refrains from forcing a classification and, positively, useful 
to select amino acid sequences with a weak folding signature. These left 
over sequences could be an interesting category per se, or, simply, a group 
of proteins which overcome the discriminating power of the methods here 
investigated. 

Other scoring schemes 

The score So was introduced to search for a good performance combination 
score able to take a decision in all cases. It required consensus among the 
majority of folding indexes to assign an amino acid sequence to a specific 
class, so its value could be considered as a quantitative expression of how 
typically a sequence was assigned to a class or to the other: a higher score 
meant higher consensus among different folding indexes and then a more 
definite assignment. The performance of So, evaluated on our test set, 
is reported in table [2] and is clearly lower than that of Ssu', nonetheless 
the combined use of both indexes can be helpful to reduce the number of 
unclassified proteins. We applied So to the 80 proteins left unclassified by 
Ssu, and we assigned to a folding class only those with |So| > 6, as shown 
in the last row of table denoted by Ssu /So- The combined use of Ssu 
and So gives a sensitivity of 0.79, a level of false predictions of 0.05 and, 
of the 80 proteins left unclassified by Ssu, 46 are still unclassified. With 
this combination of Ssu and of So, it is possible to effectively separate, in 
a genome, folded from unfolded proteins; moreover the method filters out 
ambiguous proteins, worth to be further studied. 

Frequency of disorder in various genomes 

In an interesting paper [M] the classifier DISOPRED2 has been used to es- 
timate the disorder frequency in 13 bacterial, 6 archaean and 5 eukaryotic 
genomes; an average of 4.2% of eubacterial, 2.0% of archaean and 33.0% of 
eukaryotic proteins were predicted to contain long disordered regions, i.e. 
segments with at least 30 consecutive disordered amino acids (see table [3]) . 
We analysed the same genomes, with the exception of Homo sapiens, by 
means of the combination scores defined in the above sections (see again 
table [3]). We observe that So predicts about 5.2% of eubacterial, 1.7% 
of archaean and 22.0% of eukaryotic proteins as natively unfolded; these 
percentages are compatible with those predicted using DISOPRED2. It is 
worth noting that the percentage of natively unfolded proteins predicted by 
Ssu are lower than those predicted by So; more precisely, the percentage 
of natively unfolded proteins predicted by Ssu are 3.7% for Bacteria, 0.8% 
for Archaea and 19.3% for Eukarya. The application of Ssu/So, useful to 
further evaluate sequences left unclassified by Ssu, gave a quite similar re- 
sult. The results obtained with our scores are correlated with those obtained 
by means of DISOPRED2 (see figure [1]), which is a predictor of disordered 
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amino acids that analyse local evolutionary properties polypeptide chains. 
Our scores combined different global indicators of folding status, based on 
the analysis of four basic parameters. The coherence in the predictions 
obtained through these two different approaches make us confident of the 
realiability of our predictions. 

It has been suggested that natively unfolded proteins are involved 
in regulatory and signalling processes inside a cell [H El E]. The higher 
percentage of natively unfolded proteins in Eukarya has been related 
to: i) the presence of finely regulated degradation pathways that allow 
disordered proteins to escape recognition processes, strictly based on the 
structure- function paradigm pQ; and ii) the necessity of flexible proteins 
within complex regulatory and signalling networks, typical of eukaryotic 
organisms [3j[5]. In fact, it has been observed that, in protein interaction 
networks, disorder is frequent in the hub proteins [10} [TT1 I12j . In figure 
[2] we attempt at establishing a scaling law; on the basis of the genomes 
here investigated we obtain that the number of natively unfolded proteins, 
detected by Ssu, is proportional to the number of proteins in the genome 
raised to the power 1.95 ± 0.21. Further studies are necessary to confirm 
the validity of this scaling law, possibly relevant for the general biology 
of genetic code translation but also in the search of allometric relations 
between frequency of disordered proteins and regulative complexity of the 
species. We are planning further studies on that. 



CONCLUSION 

Let us put in perspective the results obtained in this work. We observed 
that natively unfolded proteins have, in general, a higher mean contact en- 
ergy than folded ones; we can relate this property to their difficulty in reach- 
ing a stable configuration, corresponding to a relatively low free energy. This 
explains also their tendency to have a low mean packing, typical of extended 
conformations, corresponding to minima of the free energy separated by low 
barriers, of the order of physiological thermal energy scales ksTp^ys- It 
has been also observed that natively unfolded proteins have a lower mean 
hydrophobicity and a higher mean net charge [32], and these two param- 
eters have been used to discriminate between the two groups of proteins 
[321 l38l [33] . As suggested by Uversky [32] , natively unfolded proteins do not 
fold because their hydrophobicity is insufficient, in typical environments, 
to form the hydrophobic core necessary to nucleate the folding process. It 
is interesting to observe that mean hydrophobicity and mean contact en- 
ergy are correlated (Pearson's correlation coefficient equal to -0.74): high 
hydrophobicity stabilizes the structure and favors the spontaneous search 
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for a minimum free energy configuration. Of course the stabilization of 
a protein tertiary structure is due not only to hydrophobic, but also to 
other forces of different origin (van der Waals, hydrogen bonding, excluded 
volume); nonetheless, the strong correlation between hydrophobic! ty and 
contact energy supports the idea that contact energy incorporates a strong 
contribution from hydrophobicity. 

We checked that the indexes of fold we introduced here are invariant 
under shuffling of the amino acids in the sequences (changes limited to a 
few percent). This shuffling invariance of the indexes suggests some consid- 
erations. There is quite a large consensus that the tertiary structure of a 
protein is stabilized by hydrophobic effects and van der Waals interactions, 
not so sensible to the detailed geometry of the fold, that is modulated by 
the strongly directional hydrogen bonds and steric hinderance between 
lateral chains. These latter interactions should obey a fine dynamical 
network of geometric constraints. We think that the shuffling invariant 
folding indexes proposed up to now in the literature and in the present work 
are able to capture information related only to the geometry-independent 
forces, that are globally correlated with a peculiar bias in the amino 
acid composition of the sequence. To confirm this point, we studied the 
correlation among folding indexes and the frequencies of amino acids in the 
protein sequences. To this aim, we used the distinction proposed by Romero 
et al. in [18J. They observed that natively unfolded proteins are depleted 
in order promoting residues: W, C, F, I, Y, V, L; and enriched in disorder 
promoting residues: M, A, R, Q, S, P, E. We studied the correlation 
among order- and disorder-promoting amino acid frequencies and mean 
packing, mean contact energy and gVSL2; the results are reported in table 
HI We observe a high correlation, especially among indexes of fold and 
frequency of order-promoting amino acids; this confirms that the indexes 
here investigated are determined by the mere amino acidic composition 
and not by other more subtle effects, due to a specific order or polarity of 
the sequences. This fact points to an intrinsic limitation of the current ap- 
proaches in predicting natively unfolded proteins that deserves further study. 
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TABLES 





S n 


s P 


f P 


HQ 


0.67 


0.88 


0.12 


(P) 


0.74 


0.93 


0.07 


(E c ) 


0.74 


0.91 


0.09 


gVSL2 


0.81 


0.89 


0.11 



Table 1: Performance of single indexes of fold. Performance of: HQ, 
mean packing, mean contact energy and gV SL2 in discriminating natively 
unfolded proteins among those in test set. S n , sensitivity; S p , specificity; 
f p , number of false predictions. See Methods for definitions. 





Sn 


Sp 


f P 


n.c. 


folded 


unfolded 


Su 


0.67 


0.95 


0.05 





736 


88 


Sv 


0.85 


0.87 


0.13 





656 


168 


Ssu 


0.82 


0.95 


0.05 


80 


656 


88 


So 


0.73 


0.93 


0.07 





712 


112 


Ssu /So 


0.79 


0.94 


0.06 


46 


681 


97 



Table 2: Performance of different combination scores. Performance 
of the combination scores (see text) on the proteins of the test set. S n , 
sensitivity; S p , specificity; f p , number of false predictions; n.c, number of 
proteins left unclassified. 
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ORGANISM 


N. 


DP2 1 


So 


Ssu 




Ssu /So 




proteins 


% I > 30 


% unfolded 


% unfolded 


% n.c. 


% unfolded 


ARCHAEA 




A. per nix 


1700 


2.1 


2.2 


1.3 


5.3 


1.6 


A.fulgidus 


2418 


0.9 


1.7 


0.8 


5.0 


0.9 


Halobacterium sp. 2 


2622 


5.0 


24.4 


16.2 


30.8 


16.5 


M.j annaschii 


1768 


1.0 


1.1 


0.2 


5.4 


0.5 


P.abyssi 


1898 


1.4 


1.3 


0.5 


5.1 


0.7 


T.volcanium 


1491 


1.0 


2.1 


1.1 


4.5 


1.3 




9275 


2.0 


1.7 


0.8 


5.1 


1.0 


BACTERIA 




A tninPT^fiprm ilfiS 


5355 


5.7 


5.5 


4.1 


8.0 


4.5 


A a poli pi m VT^H 

n . acuiib LLC V i '7 


1558 


1.9 


1.5 


0.5 


5.9 


0.7 


nnpnmoTiiap AR3Q 

V..' • l_^XXV^ lL XXX V_7 XXX (H. V_> i a. A V. (J i / 


1085 


4.8 


5.8 


4.1 


9.0 


4.7 


\, fPTlinilTTI TTiSt 
V. . UC LlXXX 1 J_J kJ 


2247 


3.3 


6.2 


4.7 


7.7 


5.3 


E.coli K12 


4130 


2.8 


3.6 


2.5 


6.1 


2.8 


XX. XXX XX LI C> X X Zj CI v> -L I A X 


1615 


3.8 


3.2 


2.1 


5.2 


2.6 


M. tuberculosis H37Rv 


3989 


7.0 


10.1 


7.4 


11.6 


7.9 


N. meningitidis MC58 


2063 


4.5 


6.0 


4.4 


8.3 


4.7 


S. typhi 


4756 


2.7 


4.2 


2.9 


6.8 


3.2 


S. aureus 


2618 


4.5 


6.6 


5.5 


6.9 


5.9 


Synechocystis PCC 6803 


3569 


4.7 


4.2 


3.2 


6.4 


3.5 


T.maritima 


1856 


1.8 


2.4 


1.0 


5.8 


1.2 


T. pallidum 


1009 


6.4 


4.3 


2.7 


6.7 


3.5 




35850 


4.2 


5.2 


3.7 


7.5 


4.1 


EUKARYA 




A.thaliana 


31708 


33.8 


19.6 


17.5 


14.6 


18.0 


C.elegans 


22843 


27.5 


19.1 


16.1 


13.0 


16.8 


D . melanogaster 


20046 


36.6 


29.8 


26.5 


14.4 


27.5 


S.cerevisiae 


5880 


31.2 


19.8 


17.0 


14.2 


17.8 




80477 


33.0 


22.0 


19.3 


14.1 


20.0 



Table 3: Frequency of natively unfolded proteins in various genomes 3 . Comparison among 
the percentage of proteins having disordered segments with more than 30 consecutive amino acids 
as predicted by DISOPRED2 (DP2) and the percentage of natively unfolded proteins predicted by 
the scores defined in this work. 

1 From Ward J.J. et al., Prediction and functional analysis of native disorder in proteins 
from the three kingdoms of life, J. Mol. Biol. 2004, 337, 635-645 

2 Halobacterium sp. is an outlier, so we did not consider it in the computation of the mean of 
disordered proteins in the Archaea. 



genomes were downloaded from the ftp server of NCBI: ftp:/ /ftp. ncbi.nlm.nih.gov/genomes/ 
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fop 


/dp 


HQ 


0.74 


-0.60 


(P) 


0.91 


-0.63 


(E c ) 


-0.85 


0.57 


gVSL2 


-0.84 


0.77 



Table 4: Correlation among fold indexes and frequencies of order- (fop) and 
disorder-promoting (/dp) amino acids 



FIGURE CAPTIONS 

FIGURE 1: Frequency of natively unfolded proteins in genomes: 
correlation between combination scores and DISOPRED2. 

For each genome considered in table [3] the estimate of the average frequency 
of natively unfolded proteins, estimated with So, Ssu and Ssu/So, are 
plotted versus the estimate made, using DISOPRED2, by Ward et al. 
[21]. The correlation coefficients are: 0.84(5 ), 0.90(S S u) and 0M(S S u/S ). 

FIGURE 2: Number of predicted natively unfolded proteins vs. 
total number of proteins in various genomes. 

Logarithmic plot of the number of natively unfolded proteins, predicted by 
Ssu i vs - the total number of proteins in the genome. The exponent of the 
power law is: 1.95 ± 0.21. 
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Figure 1: 



Figure 1 Frequency of natively unfolded proteins in genomes: correlation 
between combination scores and DISOPRED2. 
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Figure 2: 



Figure 2 Number of predicted natively unfolded proteins vs. total number 
of proteins in various genomes. 
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